<HTML>
<BODY BGCOLOR=white>
<PRE>
<!-- Manpage converted by man2html 3.0.1 -->
NAME
     Sun Grid Engine Checkpointing - the Sun Grid  Engine  check-
     pointing mechanism and checkpointing support

DESCRIPTION
     Sun Grid Engine supports two levels  of  checkpointing:  the
     user  level  and  a  operating  system  provided transparent
     level. User  level  checkpointing  refers  to  applications,
     which do their own checkpointing by writing restart files at
     certain times or algorithmic steps and by properly  process-
     ing these restart files when restarted.

     Transparent checkpointing has to be provided by the  operat-
     ing system and is usually integrated in the operating system
     kernel. An example for  a  kernel  integrated  checkpointing
     facility is the Hibernator package from Softway for SGI IRIX
     platforms.

     Checkpointing jobs need to be identified  to  the  Sun  Grid
     Engine  system by using the -<I>ckpt</I> option of the <I>qsub1</I>() com-
     mand. The argument to this flag refers to a so called check-
     pointing  environment,  which  defines the attributes of the
     checkpointing method  to  be  used  (see  <I>checkpoint5</I>()  for
     details).   Checkpointing  environments  are  setup  by  the
     <I>qconf1</I>() options -<I>ackpt</I>,  -<I>dckpt</I>,  -<I>mckpt</I>  and  -<I>sckpt</I>.  The
     <I>qsub1</I>()  option  -<I>c</I> can be used to overwrite the <I>when</I> attri-
     bute for the referenced checkpointing environment.

     If a queue is of the type CHECKPOINTING, jobs need  to  have
     the checkpointing attribute flagged (see the -ckpt option to
     <I>qsub1</I>()) to be permitted to run in such a queue. As  opposed
     to  the  behavior for regular batch jobs, checkpointing jobs
     are aborted under conditions, for which batch or interactive
     jobs are suspended or even stay unaffected. These conditions
     are:

     <B>o</B>  Explicit suspension of the queue or job  via  <I>qmod1</I>()  by
        the  cluster  administration  or  a  queue owner if the <I>x</I>
        occasion specifier (see <I>qsub1</I>() -<I>c</I> and <I>checkpoint5</I>()) was
        assigned to the job.

     <B>o</B>  A load average value exceeding the suspend  threshold  as
        configured    for    the    corresponding   queues   (see
        <I>queue</I>_<I>conf5</I>().)

     <B>o</B>  Shutdown  of  the  Sun  Grid  Engine   execution   daemon
        <I>sge</I>_<I>execd8</I>() being responsible for the checkpointing job.

     After abortion, the jobs will migrate to other queues unless
     they  were  submitted  to  one specific queue by an explicit
     user request.  The migration of jobs leads to a dynamic load
     balancing.   Note:  The  abortion  of checkpointed jobs will
     free all resources (memory, swap space) which the job  occu-
     pies  at  that  time.  This  is opposed to the situation for
     suspended regular jobs, which still cover swap space.

RESTRICTIONS
     When a job migrates to a queue on another machine at present
     no files are transferred automatically to that machine. This
     means that all files which are used  throughout  the  entire
     job  including  restart files, executables and scratch files
     must be visible  or  transferred  explicitly  (e.g.  at  the
     beginning of the job script).

     There are also some practical limitations regarding  use  of
     disk space for transparently checkpointing jobs. Checkpoints
     of a  transparently  checkpointed  application  are  usually
     stored  in  a  checkpoint file or directory by the operating
     system. The file or directory contains all the  text,  data,
     and  stack space for the process, along with some additional
     control information. This means jobs which use a very  large
     virtual  address  space  will generate very large checkpoint
     files. Also the workstations on which the jobs will actually
     execute  may  have  little  free  disk space. Thus it is not
     always possible to transfer a transparent checkpointing  job
     to  a machine, even though that machine is idle. Since large
     virtual memory jobs must wait for a  machine  that  is  both
     idle,  and  has a sufficient amount of free disk space, such
     jobs may suffer long turnaround times.

SEE ALSO
     <I>sge</I>_<I>intro1</I>(,) <I>qconf1</I>(,) <I>qmod1</I>(,) <I>qsub1</I>(,) <I>checkpoint5</I>(,) <I>Sun</I>
     <I>Grid</I>  <I>Engine</I> <I>Installation</I> <I>and</I> <I>Administration</I> <I>Sun</I> <I>Grid</I> <I>Engine</I>
     <I>User</I>'<I>s</I> <I>Guide</I>

COPYRIGHT
     See <I>sge</I>_<I>intro1</I>() for a full statement of rights and  permis-
     sions.
















</PRE>
<HR>
<ADDRESS>
Man(1) output converted with
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>
</ADDRESS>
</BODY>
</HTML>
